﻿<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></meta><style>/*<![CDATA[*/

table{border: 1px solid gray;}
td{border: 1px dotted gray;}
p{margin: 3px 0 3px 0; padding: 0;}
#ID_Footer{font-size: small; font-style: italic; text-align: right; margin-top: 4em; padding-top: 4px; border-top: 2px solid gray;}

/*]]>*/</style><title>爬虫</title></head><body>
<div><span style="font-family: 微软雅黑; font-size: 9pt; color: #000000; line-height: 140%">知乎：如何入门 Python 爬虫？ </span><a href="https://www.zhihu.com/question/20899988" style="font-family: 微软雅黑; font-size: 9pt; text-decoration: underline; color: #0000ff">https://www.zhihu.com/question/20899988</a></div>
<div><br /></div>
<div><span style="font-family: 微软雅黑; font-size: 9pt; line-height: 140%">需要学习：</span></div>
<div>
<ol>
<li><span style="font-family: 微软雅黑; font-size: 9pt"> &nbsp; &nbsp; &nbsp; &nbsp;基本的爬虫工作原理</span></li>
<li><span style="font-family: 微软雅黑; font-size: 9pt"> &nbsp; &nbsp; &nbsp; &nbsp;基本的http抓取工具，scrapy</span></li>
<li><span style="font-family: 微软雅黑; font-size: 9pt"> &nbsp; &nbsp; &nbsp; &nbsp;Bloom Filter: Bloom Filters by Example</span></li>
<li><span style="font-family: 微软雅黑; font-size: 9pt"> &nbsp; &nbsp; &nbsp; &nbsp;如果需要大规模网页抓取，你需要学习分布式爬虫的概念。其实没那么玄乎，你只要学会怎样维护一个所有集群机器能够有效分享的分布式队列就好。最简单的实现是python-rq: </span><a href="https://github.com/nvie/rq" style="font-family: 微软雅黑; font-size: 9pt; text-decoration: underline; color: #0000ff">https://github.com/nvie/rq</a></li>
<li><span style="font-family: 微软雅黑; font-size: 9pt"> &nbsp; &nbsp; &nbsp; &nbsp;rq和Scrapy的结合：darkrho/scrapy-redis · GitHub</span></li>
<li><span style="font-family: 微软雅黑; font-size: 9pt"> &nbsp; &nbsp; &nbsp; &nbsp;后续处理，网页析取(grangier/python-goose · GitHub)，存储(Mongodb)</span></li></ol>
<div><br /></div>
<div><br /></div></div><script type="text/javascript" language="javascript" src="jquery.js"></script><script type="text/javascript" language="javascript" src="itemlink.js"></script></body></html>